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The vast amount of biological knowledge accumulated over the 
years has allowed researchers to identify various biochemical interac- 
tions and define different families of pathways. There is an increased 
interest in identifying pathways and pathway elements involved in 
particular biological processes. Drug discovery efforts, for example, 
are focused on identifying biomarkers as well as pathways related to 
a disease. We propose a Bayesian model that addresses this ques- 
tion by incorporating information on pathways and gene networks 
in the analysis of DNA microarray data. Such information is used 
to define pathway summaries, specify prior distributions, and struc- 
ture the MCMC moves to fit the model. We illustrate the method 
with an application to gene expression data with censored survival 
outcomes. In addition to identifying markers that would have been 
missed otherwise and improving prediction accuracy, the integration 
of existing biological knowledge into the analysis provides a better 
understanding of underlying molecular processes. 

1. Introduction. DNA microarrays have been used successfully to iden- 
tify gene expression signatures characteristic of disease subtypes [Golub et al. 
(1999)] or distinct outcomes to therapy [Shipp et al. (2002)]. Many statistical 
methods have been developed to select genes for disease diagnosis, prognosis 
and therapeutic targets. However, gene selection alone may not be sufficient. 
In cancer pharmacogenomics, for instance, cancer drugs are increasingly 
designed to target specific pathways to account for the complexity of the 
oncogenic process and the complex relationships between genes [Downward 
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(2006)]. Metabolic pathways, for example, are defined as a series of chemi- 
cal reactions in a living cell that can be activated or inhibited at multiple 
points. If a gene at the top of a signaling cascade is selected as a target, it 
is not guaranteed that the reaction will be successfully inactivated, because 
multiple genes downstream can still be activated or inhibited. Signals are 
generally relayed via multiple signaling routes or networks. Even if a branch 
of the pathway is completely blocked by inhibition or activation of multiple 
genes, the signal may still be relayed through an alternative branch or even 
through a different pathway [Bild et al. (2006)]. Downward (2006) pointed 
out that targeting a single pathway or a few signaling pathways might not 
be sufficient. Thus, the focus is increasingly on identifying both relevant 
genes and pathways. Genes and/or gene products generally interact with 
one another and they often function together concertedly. Here we propose 
a Bayesian model that addresses this question by incorporating informa- 
tion of pathway memberships and gene networks in the analysis of DNA 
microarray data. Such information is used to define pathway summaries, 
specify prior distributions, and structure the MCMC moves. 

Several public and commercial databases have been developed to struc- 
ture and store the vast amount of biological knowledge accumulated over 
the years into functionally or biochemically related groups. These databases 
focus on describing signaling, metabolic or regulatory pathways. Some ex- 
amples include Gene Ontology (GO) [Ashburner et al. (2000)], Kyoto En- 
cyclopedia of Genes and Genomes (KEGG) [Kanehisa and Goto (2000)], 
MetaCyc [Krieger et al. (2004)], PathDB, Reactome KnowledgeBase [Joshi- 
Tope et al. (2005)], Invitrogen iPath (www.invitrogen.com) and Cell Signal- 
ing Technology (GST) Pathway (www.cellsignal.com). The need to integrate 
gene expression data with the biological knowledge accumulated in these 
databases is well recognized. Several software packages that query pathway 
information and overlay DNA microarray data on pathways have been devel- 
oped. Nakao et al. (1999) implemented a visualization tool that color-codes 
KEGG pathway diagrams to reflect changes in their gene expression levels. 
GenMAPP [Dahlquist et al. (2002)] is another graphical tool that allows 
visualization of microarray data in the context of biological pathways or 
any other functional grouping of genes. Doniger et al. (2003) use GenMAPP 
to view genes involved in specific GO terms. Another widely used method 
that relates pathways to a set of differentially expressed genes is the gene 
set enrichment analysis (GSEA) [Subramanian et al. (2005)]. Given a list of 
genes, GSEA computes an enrichment score to reflect the degree to which a 
predefined pathway is over-represented at the top or bottom of the ranked 
list. These procedures are useful starting points to observe gene expression 
changes for known biological processes. 

Recent studies have gone a step further and focused on incorporating 
pathway information or gene-gene network information into the analysis 
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of gene expression data. For example, Park, Hastie and Tibshirani (2007) 
have attempted to incorporate GO annotation to predict survival time, first 
grouping genes based on their GO membership, calculating the first princi- 
pal component to form a super-gene within each cluster and then applying 
a Cox model with Li penalty to identify super-genes, that is, GO terms 
related to the outcome. Wei and Li (2007) have considered a small set of 
33 preselected signaling pathways and used the implied relationships among 
genes to infer differentially expressed genes, and Wei and Li (2008) have ex- 
tended this work by including a temporal dimension. Li and Li (2008) and 
Pan, Xie and Shen (2010) have proposed different procedures that use the 
gene-gene network to build penalties in a regression model for gene selec- 
tion. Bayesian approaches have also been developed. Li and Zhang (2010) 
have incorporated the dependence structure of transcription factors in a re- 
gression model with gene expression outcomes. There, a network is defined 
based on the Hamming distance between candidate motifs and used to spec- 
ify a Markov random field prior for the motif selection indicator. Telesca 
et al. (2008) have proposed a model for the identification of differentially 
expressed genes that takes into account the dependence structure among 
genes from available pathways while allowing for correction in the gene net- 
work topology. Stingo and Vannucci (2011) use a Markov random field prior 
that captures the gene-gene interaction network in a discriminant analysis 
setting. 

These methods use the gene-pathway relationships or gene network infor- 
mation to identify either the important pathways or the genes. Our goal is to 
develop a more comprehensive method that selects both pathways and genes 
using a model that incorporates pathway-gene relationships and gene depen- 
dence structures. In order to identify relevant genes and pathways, latent 
binary vectors are introduced and updated using a two-stage Metropolis- 
Hastings sampling scheme. The gene networks are used to define a Markov 
random field prior on the gene selection indicators and to structure the 
Markov chain Monte Carlo (MCMC) moves. In addition, the pathway infor- 
mation is used to derive pathway expression measures that summarize the 
group behavior of genes within pathways. In this paper we make use of the 
first latent components obtained by applying partial least squares (PLS) 
regressions on the selected genes from each pathway. PLS is an efficient 
statistical regression technique that was initially proposed in the chemo- 
metrics literature [Wold (1966)] and more recently used for the analysis of 
genomic and proteomic data; see Boulesteix and Strimmer (2007). We apply 
the model to simulated and real data using the pathway structure from the 
KEGG database. 

Our simulation studies show that the MRF prior leads to a better separa- 
tion between relevant and nonrelevant pathways, and to less false positives 
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in a model with fairly small regression coefficients. Other authors have re- 
ported similar results. Li and Zhang (2010), in particular, comment on the 
effect of the MRF prior on the selection power in their linear regression 
setting. They also notice that adding the MRF prior implies a relatively 
small increase in computational cost. Wei and Li (2007, 2008) report that 
their method is quite effective in identifying genes and modified subnetworks 
and that it has higher sensitivity than commonly used procedures that do 
not use the pathway structure, with similar and, in some cases, lower false 
discovery rates. Furthermore, in our model formulation we use the network 
information not only for prior specification but also to structure the MCMC 
moves. This is helpful for arriving at promising models faster by proposing 
relevant configurations. In real data applications the integration of pathway 
information may allow the identification of relevant predictors that could be 
missed otherwise, aiding the interpretation of the results, in particular, for 
the selected genes that are connected in the MRF, and also improving the 
prediction accuracy of selected models. 

The paper is organized as follows. Section 2 contains the model formula- 
tion and prior specification. Section 3 describes the MCMC procedure and 
strategies for posterior inference. In Section 4 performances are evaluated 
on simulated data and an application of the method to gene expression data 
with survival outcomes is presented. Section 5 concludes the paper with 
a brief discussion. 

2. Model specification. We describe how we incorporate pathway and 
gene network information into a Bayesian modeling framework for gene and 
pathway selection. Figure 1 represents a schematic representation of our 
approach and model. 

2.1. Regression on latent measures of pathway activity. Our goal is to 
build a model for identifying pathways related to a particular phenotype 
while simultaneously locating genes from these selected pathways that are 
involved in the biological process of interest. The data we have available for 
analysis consist of the following: 

(1) Y , an n X 1 vector of outcomes. 

(2) X, an n X p matrix of gene expression levels. Without loss of gener- 
ality, X is centered so that its columns sum to 0. 

(3) S, a K X p matrix indicating membership of genes in pathways, with 
elements Skj = 1 if gene j belongs to pathway k, and s^j = otherwise. 

(4) 'R, a pxp matrix describing relationships between genes, with rij = 1 
if genes i and j have a direct link in the gene network, and r^j = otherwise. 

The matrices S and R are constructed using information retrieved from 
pathway databases; see the application in Section 4.2 for details. 
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Fig. 1. Schematic representation of our proposed approach. Information on known path- 
ways and gene-gene networks is obtained from available databases. PLS components re- 
stricted to known pathways serve as possible regressors to predict a disease outcome, ac- 
cording to model (1). The goal of the inference is to identify the pathways to be included 
in the model and the genes to be included within those pathways. 

Since the goal of the analysis is to study the association between the 
response variable and the pathways, we need to derive a score as a measure of 
"pathway expression" that summarizes the group behavior of included genes 
within pathways. We do this by using the latent components from a PLS 
regression of Y on selected subsets of genes from each pathway. A number 
of recent studies have, in fact, applied dimension reduction techniques to 
capture the group behavior of multiple genes. Pittman et al. (2004), for 
instance, first apply fc-means clustering to identify subsets of potentially 
related genes, then use as regressors the first principal components obtained 
from applying principal component analysis (PCA) to each cluster. Bair 
et al. (2006) start by removing genes that have low univariate correlation 
with the outcome variable, then apply PCA on the remaining genes to form 
clusters or conceptual pathways, which are used as regressors. In our method, 
instead of attempting to infer conceptual pathways, we use the existing 
pathway information. We compute a pathway activity measure by applying 
PLS regression of y on a subset of selected genes from the pathway. PLS 
has the advantage of taking into account the covariance between regressors 
and the response variable Y, whereas PCA focuses solely on the variability 
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in the covariate data. The selection of a subset of gene expressions to form 
the PLS components is similar in spirit to the sparse PCA method proposed 
by Zou, Hastie and Tibshirani (2006), which selects variables to form the 
principal components. 

To identify both relevant groups and important genes, we introduce two 
binary vector indicators, a K x 1 vector 6 for the inclusion of the groups 
and a p X 1 vector 7 for the inclusion of genes, that is, 7^ = 1 if gene j is 
selected for at least one pathway score, and 7^ = otherwise. Assuming that 
the response Y is continuous, the linear regression model that relates Y to 
the selected pathways and genes is 

Ke 

(1) Y = la + Y,Tk{'y)l3k{y)+s, e^M{0,a^I), 

k=l 

where Kg = Ylk=i ^k is the number of selected pathways and where 7a:{7) cor- 
responds to the first latent PLS component generated based on the expres- 
sion levels of selected genes belonging to pathway k, that is, using the Xj's 
corresponding to s^j = 1 and 7j = 1. To be more precise, let pathway k 
contain pk = Yl'j=i ^kj genes and let p^-y = Yl^j=i ^kjlj denote the number 
of selected genes (i.e., genes included in the model) that belong to path- 
way k. Then Tfe('y) corresponds to the first latent PLS component generated 
by applying PLS to the expression data of the p^^, genes, denoted as X^(^), 

where Ui is the pk^ x 1 eigenvector corresponding to the largest eigenvalue 
of CxyCjy, with Cxy = cov(X;j(^),y) [see, e.g., Lindgren, Geladi and Wold 
(1993)]. Thus, Tfc(7) is an n x 1 vector and Pkil) is a scalar. Model (1) can 
therefore be seen as a PLS regression model with PLS components restricted 
to available pathways, and where the goal of the inference is to identify the 
pathways to be included in the model, and the genes to be included within 
those pathways. 

2.2. Models for categorical or censored outcomes. In the construction 
above, we have assumed a continuous response. However, our model for- 
mulation can easily be extended to handle categorical or censored outcome 
variables. 

When y is a categorical variable taking one of G possible values, 0, . . . , 
G — 1, a probit model can be used, as done by Albert and Chib (1993), Sha 
et al. (2004) and Kwon et al. (2007). Briefiy, each outcome Yi is associated 
with a vector {pifi, . . . ,pi^G-i)i where pig = P{Yi = g) is the probability that 
subject i falls in the gth. category. The probabilities pig can be related to the 
linear predictors using a data augmentation approach. Let Zj be latent data 
corresponding to the unobserved propensities of subject i to belong to one of 
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the classes. When the observed outcomes Yi correspond to nominal values, 
the relationship between Yi and Zj = {zi^i, . . . , Zi^c-i) can be defined as 



(2) 



' 0, if max \zii\ < 0, 
\<1<G~\ ' 

q, if max {zn} > and Zi a = max izn}. 

KKG-l KKG-1 



A multivariate normal model can then be used to associate Zj to the pre- 
dictors 

Kg 

(3) Z, = la + J2T^M'f)(^H^) + ^i^ e,~A/'(0,S),f = l,...,n. 

k=l 

If the observed outcomes Yi correspond, instead, to ordinal categories, 
the latent variable Zi is defined such that Yi = g if 5g < Zi < g = 

0,. . . ,G — 1, where the boundaries 6g are unknown and —oo = 6q<6i< 
■ ■ ■ < 6c-i < 5a = oo. The latent variable Zi is associated with the predictors 
through the linear model 



(4) = a + ^rj^fc(^)/3fc(^) +ei, ~7\A(0,(T^),i = 1, 



fc=i 



,n. 



For censored survival outcomes, an accelerated failure time (AFT) model 
can be used [Wei (1992); Sha, Tadesse and Vannucci (2006)]. In this case, 
the observed data are Yi = min(Tj, Cj) and 5i = I{Yi < Cj}, where Tj is the 
survival time for subject i, Ci is the censoring time, and 5i is a censoring 
indicator. A data augmentation approach can be used and latent variables Zi 
can be introduced such that 



(5) 



Z, = \og{Yi), if<5i = l, 
Z,>log(yi), ii5i = Q. 



The AFT model can then be written in terms of the latent Zi similarly to (4) 
where the e^'s are independent and identically distributed random variables 
that may take one of several parametric forms. Sha, Tadesse and Vannucci 
(2006) consider cases where £i follows a normal or a t-distribution. 

2.3. Prior for regression parameters. The regression coefficient (5^ in (1) 
measures the effect of the PLS latent component summarizing the effect of 
pathway k on the response variable. However, not all pathways are related 
to the phenotype and the goal is to identify the predictive ones. Bayesian 
methods that use mixture priors for variable selection have been thoroughly 
investigated in the literature, in particular, for linear models; see George and 
McCulloch (1997) for multiple regression. Brown, Vannucci and Fearn (1998) 
for extensions to multivariate responses and Sha et al. (2004) for probit mod- 
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els. A comprehensive review on features of the selection priors and on compu- 
tational aspects of the method can be found in Chipman, George and McCul- 
loch (2001). Similarly, we use the latent vector 6 to specify a scale mixture 
of a normal density and a point mass at zero for the prior on each in (1): 

(6) (3k\ek,a^^ek-M{Po,ha^) + {l-ek)-6o{(3k), k = l,...,K, 

where do{Pk) is a Dirac delta function. The hyperparameter h in (6) regu- 
lates, together with the hyperparameters of p{0,^\rj) defined in Section 2.4 
below, the amount of shrinkage in the model. We follow the guidelines pro- 
vided by Sha et al. (2004) and specify h in the range of variability of the data 
so as to control the ratio of prior to posterior precision. For the intercept 
term, a, and the variance, ci^, we take conjugate priors a\a'^ ~ AA(ao; ^o^^) 
and (T^ ~ Inv-Gamma(z/o/2, i'oCo/2), where ao, /3o, ho, h, z/q and Uq are to 
be elicited. 



(7) 



2.4. Priors for pathway and gene selection indicators. In this section we 
define the prior distributions for the pathway selection indicator, 0, and gene 
selection indicator, 7. These priors are first defined marginally then jointly, 
taking into account some necessary constraints. 

Each element of the latent ET- vector 9 is defined as 

1, if pathway k is represented in the model, 
0, otherwise 

for k = l,. . . ,K. We assume independent Bernoulli priors for the ^fc's, 

K 

(8) p{e\^u) = J[4'{^-Vkf"'\ 

k=l 

where ipk determines the proportion of pathways expected a priori in the 
model. A mixture prior can be further specified for ip^ to achieve a better 
discrimination in terms of posterior probabilities between significant and 
nonsignificant pathways by inflating p{9k = 0) toward 1 for the nonrelevant 
pathways, as first suggested by Lucas, Carvalho, Wang, Bild, Nevins and 
West (2006), 

(9) P{^k) =pSo{^k) + (1 - p)S{ipk\ao,bo), 

where B{(pk\aQ,b()) is a Beta density function with parameters oq and 60. 
Since inference on ipk is not of interest, it can be integrated out to simplify 
the MCMC implementation. This leads to the following marginal prior for 6: 

B{aQ + 9kM + ^-0k)' 



p.{i-ek) + {i-p) 



(10) p{e) = \[ , . 

where B{-,-) is the Beta function. Prior (10) corresponds to a product of 
Bernoulli distributions with parameter ip^ = . 
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For the latent vector 7 we specify a prior distribution that is able to take 
into account not only the pathway membership of each gene but also the 
biological relationships between genes within and across pathways, which 
are captured by the matrix R. Following Li and Zhang (2010), we model 
these relations using a Markov random field (MRF), where genes are repre- 
sented by nodes and relations between genes by edges. A MRF is a graphical 
model in which the distribution of a set of random variables follow Markov 
properties that can be described by an undirected graph. In particular, two 
unconnected genes are considered conditionally independent given all other 
genes [Besag (1974)]. Relations on the MRF are represented by the following 
probabilities: 



where F^jj) = (^ + ^X^iga^jT*)) direct neighbors of 

gene j in the MRF using only pathways represented in the model, that is, 
pathways with 0^ = 1. The corresponding global distribution on the MRF is 
given by 



with Ip the unit vector of dimension p and R the matrix introduced in 
Section 2.1. The parameter fj, controls the sparsity of the model, while r] 
regulates the smoothness of the distribution of 7 over the graph by con- 
trolling the prior probability of selecting a gene based on how many of its 
neighbors are selected. In particular, higher values of rj encourage the selec- 
tion of genes with neighbors already selected into the model. If a gene does 
not have any neighbor, then its prior distribution reduces to an indepen- 
dent Bernoulli with parameter p = exp(;Li)/[l -|- exp(/.i)], which is a logistic 
transformation of /i. 

Here, unlike Li and Zhang (2010), who fix both parameters of the MRF 
prior, we specify a hyperprior for 1]. We give positive probability to values 
of r] bigger than 0, which is biologically more intuitive than negative values of 
this parameter (which would favor neighboring genes to have different inclu- 
sion status). Such restriction on the domain of rj also minimizes the "phase 
transition" problem that typically occurs with MRF parameterizations of 
type (11), where the dimension of the selected model increases massively 
for small increments of rj. When the phase transition occurs the number of 
selected genes increases substantially. Here, after having detected the phase 
transition value rjpT, by simulating from (12) over a grid of rj values, we 
specify a Beta distribution Beta(co,(io) on r]/r]PT- 

Constraints need to be imposed to ensure both interpretability and iden- 
tifiability of the model. We essentially want to avoid the following: 

(1) empty pathways, that is, selecting a pathway but none of its member 
genes; 



(11) 





Phl^, fJ', « exp(/il' 7 + r/7'R7) 
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(2) orphan genes, that is, selecting a gene but none of the pathways that 
contain it; 

(3) selection of identical subsets of genes by different pathways, a situ- 
ation that generates identical values and Ti^'{-y) to be included in the 
model. 

These constraints imply that some combinations of 6 and -f values are not 
allowed. The joint prior probability for (0,7) taking into account these con- 
straints is given by 

' K 

n Vk' (1 - Vl)'-'" exp(/.i;7 + VJ'^I), 
piOnlv) < k=l 

for valid configurations, 
^ 0, for invalid configurations. 

3. Model fitting. We now describe our MCMC procedure to fit the model 
and discuss strategies for posterior inference with huge posterior spaces, as 
in this model. In the Bayesian literature on variable selection for standard 
linear regression models stochastic search algorithms have been designed to 
explore the posterior space, and have been successfully employed in genomic 
applications with prohibitive settings, handling models with thousands of 
genes. A key to these applications is the assumption of sparsity of the model, 
that is, the belief that the response is associated with a small number of 
regressors. A stochastic search then allows one to explore the posterior space 
in an effective way, quickly finding the most probable configurations, that is, 
those corresponding to coefficients with high marginal probabilities, while 
spending less time in regions with low posterior probability. 

We describe below the MCMC algorithm we have designed for our prob- 
lem. In particular, borrowing from the literature on stochastic searches for 
variable selection, we work with a marginalized model and design a Metro- 
polis-Hastings algorithm that updates the indicator parameters for the in- 
clusion of pathways and genes with a set of moves that add and/or delete 
a single gene and a single pathway. Also, we update the parameter t] of 
the MRF from its posterior distribution by employing the general method 
proposed by M0ller et al. (2006). In Stingo et al. (2011) we discuss how 
our Bayesian stochastic search variable selection kernel generates an ergodic 
Markov chain over the restricted space. In applications, we have found that 
a good way to asses if the stochastic exploration can be considered satisfac- 
tory is to check the concordance of the posterior probabilities obtained from 
different chains started from different initial points. 

3.1. Marginal posterior probabilities. The model parameters consist of 
(a,/3, £7^,')', 0,77). The MCMC procedure can be made more efficient by 
integrating out some of the parameters. Here, we integrate out the regression 
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parameters, a, (3 and o"^. This leads to a multivariate t-distribution 

(13) /(y|T,0,7)~r.o(aol„+T(e,^)/3o,cTg(I„+/iolni:.+T(,,^)SoT'(e^^))), 

with vq degrees of freedom and 1„ an n-vector of ones, and where Sq = Klxg , 
with In the n X n identity matrix, and ^(6,^) the n x Kg matrix derived 
from the first PLS latent components for the selected pathways using the 
selected genes. In the notation (13) the two arguments of the t-distribution 
represent the mean and the scale parameter of the distribution, respectively. 
The posterior probability distribution of the pathway and gene selection 
indicators is then given by 

(14) /(0,7,r?|T,y) « f{Y\T,9,j)-p{e,-f\7])-p{ij). 

3.2. MCMC sampling. The MCMC steps consist of the following: (I) 
sampling pathway and gene selection indicators from j>(0, 7|rest); (II) sam- 
pling the MRF parameter from p(r/|rest); (III) sampling additional parame- 
ters introduced when fitting probit models for categorical outcomes or AFT 
models for survival data. 

(I) The parameters (^,7) are updated using a Metropolis-Hastings algo- 
rithm in a two-stage sampling scheme. The pathway-gene relationships 
are used to structure the moves and account for the constraints spec- 
ified in Section 2.4. Details of the MCMC moves to update (^,7) are 
given in Stingo et al. (2011). They consist of randomly choosing one 
of the following move types: 

(1) change the inclusion status of gene and pathway by randomly 
choosing between adding a pathway and a gene or removing them 
both; 

(2) change the inclusion status of gene but not pathway by randomly 
choosing between adding a gene or removing a gene; 

(3) change the inclusion status of pathway but not gene by randomly 
choosing between adding a pathway or removing a pathway. 

(II) At this step we want to draw the MRF parameter r/ from the posterior 
density p(r/|7) p{ri)p['j\'q) . The prior distribution on 7 is of the form 

(15) p(7|r/)=(Z,(7)/^, 

with unnormalized density q-qil) and a normalizing constant which 
is not available analytically. When calculating the Metropolis-Hastings 
ratio to determine the acceptance probability of a new value rf , 

with r]° the current value for r], one needs to take into account that 
Z^p/Z^o ^ 1. Following M0ller et al. (2006), we introduce an auxiliary 
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variable w, defined on the same state space as that of 7, which has 
conditional density f{'w\rj,'-f), and consider the posterior p{r],w\j) oc 
f{w\r],j)p{r])qrj{'y)/Zrj, which of course still involves the unknown Z^. 
Obviously, marginalization over w oi p{r],w\'y) gives the desired dis- 
tribution p{'i]\'y)- Now, if {r]°,w°) is the current state of the algo- 
rithm, we first propose tjP with density (/(f?^!??"), then with den- 
sity q{w'P\w° ,rf ,7]°). As usual, the choice of these proposal densities 
is arbitrary from the point of view of the equilibrium distribution of 
the chain of r/ values. The choice of f{w\r],'f) is also arbitrary. The 
key idea of the method proposed by M0ller et al. (2006) is to take the 
proposal density for the auxiliary variable w to be of the same form 
as (15), but dependent on rf rather than ry°, that is, 

(17) q{wP\w",riP,r^°)=p{w^W)=qr,r>{wn/Zr,v. 

Then the Metropolis-Hastings ratio becomes 

/(u;P|r/P, 7)p(r/P)g^P (7)9^0 (u;°)g(77° |r/P) 



(18) H{r]P,wP\r]°,w'' 



fiw°\r]°,j)p{r]°)qr^o (7)5^? {wP)q{T]P\r]° 



and no longer depends on /Z^^o. The new value for the auxiliary 
variable w is drawn from (17) by perfect simulation using the algorithm 
proposed by Propp and Wilson (1996). 
(Ill) In the case of classification or survival outcomes, the augmented data Z 
need to be updated from their full conditionals using Gibbs sampling; 
see Sha et al. (2004), Sha, Tadesse and Vannucci (2006) and Kwon 
et al. (2007) for details. 



3.3. Posterior inference. The MCMC procedure results in a list of vis- 
ited models with included pathways indexed by 6 and selected genes indexed 
by 7, and their corresponding relative posterior probabilities. Pathway selec- 
tion can be based on the marginal posterior probabilities p{9k\T,Y). A sim- 
ple strategy is to compute Monte-Carlo estimates by counting the number of 
appearances of each pathway across the visited models. Relevant pathways 
are identified as those with largest marginal posterior probabilities. Then 
relevant genes from these pathways are identified based on their marginal 
posterior probabilities conditional on the inclusion of a pathway of inter- 
est, p{'yj\T,Y,I{6kSkj = !})■ An alternative inference for gene selection is 
to focus on a subset of pathways, V, and consider the marginal posterior 
probability conditional on at least one pathway the gene belongs to be- 
ing represented in the model, p(7j|T, Y", /{^^gp^feS^j > 0}). We note that 
Rao-Blackwellized estimates have been employed in standard linear regres- 
sion models, in place of frequency estimates, by averaging the full condi- 
tional posterior probabilities of the inclusion indicators. These estimates are 
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computationally quite expensive, though they may have better precision, as 
noted by Guan and Stephens (2011). Because of our strategy for inference, 
that selects first pathways and then genes conditional on selected pathways, 
Rao-Blackwellized estimates of marginal probabilities may not be straight- 
forward to derive. In all simulations and examples reported in this paper we 
have obtained satisfactory results by simply estimating the marginal poste- 
rior probabilities with the corresponding relative frequencies of inclusion in 
the visited models. 

Inference for a new set of observations, (X.f,Yf), can be done via least 
squares prediction, Yj = l^a + Tj(g where T^^g ,^) is the first prin- 

cipal component based on selected genes from relevant pathways and where 
a = Y and = (T|g ,^-)T(e^^) + /i~^IxJ~^T|g ,^^1", with Y the response 

variable in the training and T(g^) the scores obtained from the training 
data using selected pathways and genes included in the model. Note that 
for prediction purposes, since we do not know the future Yf, a PLS regression 
cannot be fit. Therefore, we generate 7/(0,7) by considering the first latent 
component obtained by applying PCA to each selected pathway using the 
included genes. 

In the case of categorical or censored survival outcomes, the sampled 
latent variables Z would be used to estimate Zf, then the correspondence 
between Z and the observed outcome outlined in Section 2.2 can be invoked 
to predict Yf [Sha et al. (2004, 2006); Kwon et al. (2007)]. 

4. Application. We assess performances on simulated data, then illus- 
trate an application to microarrays using the KEGG pathway database to 
define the MRF. 

4.1. Simulation studies. We investigated the performance of our model 
using simulated data based on the gene-pathway relations, S, and gene net- 
work, R, of 70 pathways and 1,098 genes from the KEGG database. The 
relevant pathways were defined by selecting 4 pathways at random. For each 
of the 4 selected pathways, one gene was picked at random and its direct 
neighbors that belong to the selected pathways were chosen. This resulted 
in the selection of 4 pathways and 15 genes: 7 out of 30 from the first 
pathway, 3 out of 35 from the second, 3 out of 105 from the third, and 2 
out of 47 from the fourth pathway. Gene expressions for n = 100 samples 
were simulated for these 15 genes using an approach similar to Li and Li 
(2008). This was accomplished by first creating an ordering among the 15 
selected genes by changing the undirected edges in the gene networks into 
directed edges. The first node on the ordering, which we denote by Xpi , was 
selected from each pathway and drawn from a standard normal distribu- 
tion; note that this node has no parents. Then all child nodes directly con- 
nected only to Xp-^ and denoted by Xp^ were drawn from Xp^ M{Xp-^p, 1). 
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Subsequent child nodes at generation j, Xp., were drawn using all parents 
from ~ A/'(/oXpQ(^^)l|pQ(^^)|, 1), where pa{Fj) indicates the set of par- 
ents of node j and Xp(i(^^.) is a matrix containing the expressions of all the 
\pa{Fj)\ parents for node j. The expression levels of the remaining 1,073 
genes deemed irrelevant were simulated from a standard normal density. 
The response variables for the n = 100 samples were generated from 

15 

Yi = ^Xij^ + e^, ei~AA(0,l),i = l,...,100. 
i=i 

For the first data set we set /? = ±0.5, with the same sign for genes belonging 
to the same pathways. For the second and third data sets we used /3 = ±1 
and (3 = ±1.5, respectively. Note how the generating process is different from 
model (1) being fit. 

We report results obtained by choosing, when possible, hyperparameters 
that lead to weakly informative prior distributions. A vague prior is assigned 
to the intercept a by setting /iq to a large value tending to cxd. For cr^, the 
shape parameter can be set to = 3, the smallest integer such that the 
variance of the inverse-gamma distribution exists, and the scale parame- 
ter foO"Q/2 can be chosen to yield a weakly informative prior. For the vector 
of regression coefficients, we set the prior mean to /3o = and choose h 
in the range of variability of the covariates, as suggested in Section 2.3. 
Specifically, we set ho = 10^, qq = /3o = 0) t^o^'o/'^ = 0.5, and h = 0.02. For 
the pathway selection indicators, 9k, we set ifl = 0.01. As for the prior at the 
gene level, we set fi = —3.5, corresponding to setting the proportion of genes 
expected a priori in the model to, at least, 3% of the total number of genes. 
Parameters (fl and /i influence the sparsity of the model and consequently 
the magnitude of the marginal posterior probabilities. Some sensitivity is, 
of course, to be expected. However, in our simulations we have noticed that 
the ordering of pathways and genes based on posterior probability remains 
roughly the same and, therefore, the final selections are unchanged as long 
as one adjusts the threshold on the posterior probabilities. Also, for the hy- 
perprior on rj, we set riPT = 0.092, to avoid the phase transition problem, 
and Co = 5 and do = 2, to obtain a prior distribution that favors bigger val- 
ues of r] in the interval < rj < rjpp. In our simulations we did not notice 
sensitivity to the specification of cq and do- 

The MCMC sampler was run for 300,000 iterations with the first 50,000 
used as burn-in. We computed the marginal posterior probabilities for pathway 
selection, p{Ok = 1\Y,T), and the conditional posterior probabilities for gene 
selection given a subset of selected pathways, p{'yj\T, Y, I{YlkeV ^kSkj > 0}) . 
Figure 2 displays the marginal posterior probabilities of inclusion for all 70 
pathways and the conditional posterior probabilities of inclusion for all 1,098 
genes. 
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Fig. 2. Simulated data: Marginal posterior probabilities for pathway selection, 
p{6k\'T,Y), and conditional posterior probabilities for gene selection, p(7j|T,Y", 
^{X^fcg-p ^fc'^fcj > 0})' /"^ three simulated data sets. Open circles indicate pathways and 
genes used to generate the outcome variable. 



Important pathways and genes can be selected as those with highest pos- 
terior probabiUties. For example, in all 3 scenarios all four relevant pathways 
were selected with a marginal posterior probability cutoff of 0.8. Reducing 
the selection threshold to a marginal posterior probability of 0.5 pulls in 
two false positive pathways, for all the three simulated scenarios considered. 
One of the false positives is the pathway with index 17 in Figure 2, which 
contains more than 100 genes. A closer investigation of the MCMC output 
reveals that different subsets of its member genes are selected whenever it 
is included in the model, resulting in a high marginal posterior of inclusion 
for the pathway but low marginal posterior probabilities for all its mem- 
ber genes. The second false positive pathway appears to be selected often 
because it contains two or three of the relevant genes that were used to 
simulate the response variable and were also included in the model with 
high marginal posterior probabilities; all its other member genes have very 
low probabilities of selection. As expected, the identification of the relevant 
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genes is easier when the signal-to-noise ratio is higher. Conditional upon the 
best 4 selected pathways, a marginal posterior probability cutoff of 0.5 on 
the marginal probability of gene inclusion leads to the selection of 7, 8 and 8 
relevant genes, for the three scenarios, respectively, and no false positives. 
With a marginal probability threshold of 0.1, 14 of the relevant genes are se- 
lected with 4 false positives for the scenario with (3 = ±0.5, while 13 relevant 
genes are selected with only two false positives for the simulated data with 
/3 = ±1. In the simulated setting with /3 = ±1.5 all the 15 relevant genes are 
selected without any false positive at a threshold of 0.12. 

Generally speaking, the effect of the MRF prior depends on the concor- 
dance of the prior network with the data. For the simulated data, we found 
that the model with the MRF prior, compared to the same model without the 
MRF, performs better in terms of pathway selection, as it provides a clearer 
separation between relevant and nonrelevant pathways. In particular, the 
average difference, over the three scenarios, between the relevant pathway 
with the lowest posterior probability and the nonrelevant pathway with the 
highest posterior probability is 0.28, while without the MRF prior it is only 
0.18. In addition, we have observed increased sensitivity of the MRF prior in 
selecting the true variables. For example, for the simulated case with /3± 1.5, 
in order to select all 15 relevant genes, the marginal probability cutoff must 
be reduced to 0.088 at the expense of including 3 false positives. Other au- 
thors have reported similar results [Li and Zhang (2010)]. In the real data 
application we describe below, employing information on gene-gene net- 
works aids the interpretation of the results, in particular, for those selected 
genes that are connected in the MRF, and improves the prediction accuracy. 

4.2. Application to microarray data. We consider the van't Veer et al. 
(2002) breast cancer microarray data.^ Gene expression measures were col- 
lected on each patient using DNA microarray with 24,481 probes. Missing 
expressions were imputed using a fc-nearest neighbor algorithm with k = 10. 
The procedure consists of identifying the k closest genes to the one with 
missing expression in array j using the other n — 1 arrays, then imputing 
the missing value by the average expression of the k neighbors [Troyanskaya 
et al. (2001)]. We focus on the 76 sporadic lymph-node-negative patients, 33 
of whom developed distant metastasis within 5 years; the remaining 43 are 
viewed as censored cases. We randomly split the patients into a training set 
of 38 samples and a test set of the same size using a fairly balanced split of 
metastatic/nonmetastatic cases. The goal is to identify a subset of pathways 
and genes that can predict time to distant metastasis. 

The gene network and pathway information were obtained from the KEGG 
database. This was accomplished by mapping probes to pathways using the 



^Available at www.rii.com/publications/2002/vantveer.htm. 



A BAYESIAN MODEL FOR PATHWAY AND GENE SELECTION 



17 



links between pathway node identifiers and LocusLink ID. Using the R pack- 
age KEGGgraph [Zhang and Wiemann (2009)], we first downloaded the gene 
network for each pathway, then merged all networks into a single one with 
all genes. A total of 196 pathways and 3,592 probes were included in the 
analysis, with each pathway containing multiple genes and with most genes 
associated with several pathways. 

We ran two MCMC chains with different starting numbers of included 
variables, 50 and 80, respectively. We used 600,000 iterations with a burn-in 
of 100,000 iterations. We incorporated the first latent vector of the PLS for 
each pathway into the analysis as described in Section 2.1 and set the number 
of pathways expected a priori in the model to 10% of the total number. For 
the gene selection, we set the hyperparameter of the Markov random field 
to ;U = —3.5, indicating that a priori at least 3% of genes are expected to 
be selected. We set rjpT = 0.09, to avoid the phase transition problem, and 
Co = 1 and do = 1, to obtain a noninformative prior distribution. A sensitivity 
analysis showed that the posterior inference is not affected by different values 
of Co and do- We set ao = /^o = 0, ho = 10^ and h = 0.1 for the prior on the 
regression parameters and obtained a vague prior for cr^ by choosing uo/2 = 3 
and fo<7o/2 = 0.5. 

The trace plots for the number of included pathways and the number of 
selected genes showed good mixing (figures not shown). The MCMC sam- 
plers mostly visited models with 20-45 pathways and 50-90 genes. To assess 
the agreement of the results between the two chains, we looked at the cor- 
relation between the marginal posterior probabilities for pathway selection, 
p{9k\T,Y), and found good concordance between the two MCMC chains 
with a correlation coefficient of 0.9933. Concordance among the marginal 
posterior probabilities was confirmed by looking at a scatter plot of p{6k\T, Y) 
across the two MCMC chains (figure not shown). 

The model also showed good predictive performance. Sha, Tadesse and 
Vannucci (2006) already analyzed these data using an AFT model with 3,839 
probes as predictors and obtained a predictive MSE of 1.9317 using the 11 
probe sets with highest marginal probabilities. Our model incorporating 
pathway information achieved a predictive MSE of 1.4497 on the validation 
set, using 12 selected pathways and 41 probe sets with highest posterior 
probabilities. The selected pathways and genes are clearly indicated in the 
marginal posterior probability plots displayed in Figure 3. If we increase the 
marginal probability thresholds for selection and consider a model with 7 
selected pathways and 14 genes, to make the comparison more fair with the 
results of Sha, Tadesse and Vannucci (2006), we obtain a MSE of 1.7614. 
As a reminder, our model selects relevant pathways and relevant genes si- 
multaneously, while the model of Sha, Tadesse and Vannucci (2006) selects 
genes only. Of course, one can always select pathways post-hoc, as those 
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Fig. 3. Microarray data: Plot (a,): Marginal posterior probabilities for pathway selection, 
p{6k\'T,Y). The 12 pathways with largest probabilities are marked with symbols. Plot (h): 
Conditional posterior probabilities for gene selection, p(7j |T, V", /{5]]j,gp ^feSfcj > 0}). The 
41 probes with largest probability that belong to the 12 selected pathways in plot (a) are 
marked with A. 
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Table 1 

The 41 selected genes divided by islands and with associated pathway indices 

(in parenthesis) 



Singleton genes (no direct neighbor selected) 

ACACB (10), C4A (8, 12), CALMl (10), CCNB2 (5), CD4 (7), CDC2 (5), CLDNll 
(7), FZD9 (11), GYS2 (10), HIST1H2BN (12), IFNA7 (3), NFASC (7), NRCAM (7), 
PCKl (10), PFKP (10), PPARGCIA (10), PXN (9) 

Island 1 

ACTB (9), ACTGl (9), ITGAl (9), ITGA7 (9), ITGB3 (9), ITGB4 (9), ITGB6 (9), 
ITGB8 (7, 10), MYL5 (9), MYL9 (9), PDPKl (10), PIK3CD (9, 10, 11), PLA2G4A 
(2), PLCGl (11), PRKCA (2, 11), PRKY (2, 10), PRKY (2, 10), PTGS2 (11), SOCS3 
(10) 

Island 2 

ACVRIB (2, 3, 11), ACVRIB (2, 3, 11), TGFB3 (2, 3, 5, 11) 

Island 3 

ENTPD3 (1), GMPS (1) 

Notes: The pathway indices correspond to the following: 1-Purine metabolism, 2- 
MAPK signaling pathway, 3-Cytokine-cytokine receptor interaction, 4-Neuroactive ligand- 
receptor interaction, 5-Cell cycle, 6- Axon guidance, 7-Cell adhesion molecules (CAMs), 
8-Complement and coagulation cascades, 9-Regulation of actin cytoskeleton, 10-Insulin 
signaling pathway, 11-Pathways in cancer, 12-Systemic lupus erythematosus. 

that contain the selected genes. However, as single genes belong to multiple 
pathways, we expect our approach to give a more precise selection. 

From a practical point of view, researchers can use the posterior probabil- 
ities produced by our selection algorithm as a way to prioritize the relevant 
pathways and genes for further experimental work. For example, the genes 
corresponding to the best 41 selected probe sets, conditional upon the best 
12 selected pathways, are listed in Table 1 divided by islands, which corre- 
spond to sets of connected genes in the Markov random field. The islands 
help with the biological interpretation by locating relevant branches of path- 
ways. A subset of the selected pathways along with islands and singletons are 
displayed in Figure 4. Several of the identified pathways are involved in tu- 
mor formation and progression. For instance, the mitogen-activated protein 
kinase (MAPK) signaling pathway, involved in various cellular functions, 
including cell proliferation, differentiation and migration, has been impli- 
cated in breast cancer metastasis [Lee et al. (2007)]. The KEGG pathway 
in cancers was also selected with high posterior probability. Other inter- 
esting pathways are the insulin signaling pathway, which has been linked 
to the development, progression and outcome of breast cancer, and purine 
metabolism, involved in nucleotide biosynthesis and affects cell cycle activity 
of tumor cells. 

In addition, several genes with known association to breast cancer were 
also selected. Protein kinase C alpha (PKCA), which belongs to the MAPK 
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Fig. 4. Microarray data: Graphical representation of a subset of selected pathways with 
islands and singletons. The genes in the islands are listed in Table 1. 

pathway and the KEGG pathways in cancer, has been reported to play 
roles in many different cellular processes, including cell functions associ- 
ated with breast cancer progression. It has been shown to be overexpressed 
in some antiestrogen resistant breast cancer cell lines and to be involved 
in the growth of tamoxifen resistant human breast cancer cells [Frankel 
et al. (2007)]. Patients with PKCA-positive tumors have been shown to have 
worse survival than patients with PKCA-negative tumors, independently of 
other factors [L0nne et al. (2010)]. Prostaglandin-endoperoxide synthase-2 
(PTGS2, also known as cyclooxygenase-2 or C0X2) has also been related 
to breast cancer. Denkert, Winzer and Hauptmann (2004) observed C0X2 
overexpression in breast cancer and strong association with indicators of 
poor prognosis, such as lymph node metastasis, poor differentiation and 
large tumor size. This was further confirmed by Gupta et al. (2007), who 
showed that the expression of C0X2 in human breast cancer cells facilitates 
the assembly of new tumor blood vessels, the release of tumor cells into the 
circulation, and the breaching of lung capillaries by circulating tumor cells 
to seed pulmonary metastasis. This is an important finding, as the major- 
ity of breast cancer deaths result from metastases rather than direct effects 
of the primary tumor. Another gene previously shown to be predictive of 
breast cancer lung metastasis is integrin, beta-8 (ITGB8) [Landemaine et al. 
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(2008)]. We also identified integrin, beta-4 (ITGB4) which regulates key sig- 
naling pathways related to carcinoma progression, and is linked to aggressive 
tumor behavior and poor prognosis in certain breast cancer subtypes [Guo 
et al. (2006)]. 

5. Discussion. We have proposed a model that incorporates biological 
knowledge from pathway databases into the analysis of DNA microarrays to 
identify pathways and genes related to a phenotype. Information on pathway 
membership and gene networks are used to define pathway summaries, spec- 
ify prior distributions that account for the dependence structure between 
genes, and define the MCMC moves to fit the model. The gene network 
prior and the synthesis of the pathway information through PLS bring in 
additional information that is especially useful in microarray data, due to the 
low sample size and large measurement error. Performances of the method 
were evaluated on simulated data and a breast cancer gene expression study 
with survival outcomes was used to illustrate its application. 

Our simulation studies show the effect of the MRF prior on the posterior 
inference. In general, as expected, the effect of the prior depends on the 
data and, in particular, on the concordance of the prior network with the 
data. In our simulations, employing the MRF prior allows us to achieve 
a better separation of the relevant pathways from those not relevant (in 
particular, we have found a larger average difference, over three scenarios, 
between the relevant pathway with the lowest posterior probability and the 
nonrelevant pathway with the highest posterior probability). In addition, in 
the simulated setting with fairly small regression coefficients the model with 
the MRF prior was able to select all the correct genes without any false 
positive, while the model without MRF includes 3 false positives. Other 
authors have reported improvements on selection power and sensitivity with 
respect to commonly used procedures that do not use the pathway structure, 
with similar, and in some cases, lower false discovery rates. In addition, in 
our formulation of the model we have used biological information not only for 
prior specification but also to structure the MCMC moves. This is helpful in 
arriving at promising models avoiding visiting invalid configurations. Finally, 
in real data applications, we have found that employing information on gene- 
gene networks can lead to the selection of significant genes that would have 
been missed otherwise, aiding the interpretation of the results, and achieving 
better predictions compared to models that do not treat genes as connected 
elements that work in groups or pathways. 

Several MRF priors for gene selection indicators have been proposed in 
the literature. It is interesting to compare the parametrization of the MRF 
used in this paper and in Li and Zhang (2010) to the parametrization used 
in Wei and Li (2007, 2008), where the prior on 7 is defined as 



(19) 
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where rii is the number of selected genes and noi is the number of edges 
hnking genes with different values of 7^-, that is, edges linking included and 
nonincluded genes among all pathways, 



ni 



p ^ p 



p 

E 



While d plays the same role as fi in (12), the parametrization using g has 
a different effect from rj on the probability of selection of a gene. This is evi- 
dent from the conditional probability P{'yj\-,'-fi,i ^ Nj) = j^^^^^^^ where 
F(7j) = d+ gY^^f^^. {2ji — 1). Higher values of g encourage neighboring genes 
to take on the same 7j value, and, consequently, genes with nonselected 
neighbors have lower prior probability of being selected than genes with no 
neighbors. We felt that parametrization (12) was a better choice for our pur- 
poses. First, in a context of sparsity, where only few nodes are supposed to 
take value 1, a prior that assigns larger probability of inclusion to genes with 
selected neighbors than to isolated genes seems more appropriate. Second, 
the exact simulation algorithm of Propp and Wilson (1996) cannot be used 
to simulate from (19). While any other method to draw from (19) would be 
acceptable, as said by M0ller et al. (2006), Markov chain methods, to sam- 
ple from a MRF, require to check at each step that the chain has converged 
to the equilibrium distribution, to avoid introducing additional undesirable 
stochasticity. On the other hand, one advantage of parametrization (19) is 
that no phase transition problem is associated to the distribution. 

Pathway databases are incomplete and the gene network information is 
often unavailable for many genes. Thus, there may be situations where the 
dependence structure and the MRF prior specification on the gene selection 
indicator, 7, cannot be used for all genes. When the only information avail- 
able is the pathway membership of genes, the prior on 7 could be elicited to 
capture other interesting characteristics. For example, a gene can have a pri- 
ori higher probability of being selected when several pathways that contain it 
are included in the model. We may also want to avoid favoring the selection 
of a large pathway just because of its size. In such cases, conditional on 6, 
independent Bernoulli priors can be specified for 7^ relating the probability 
of selection to the proportion of included pathways that contain gene j, ad- 



justing for the pathway sizes, pk, that is, 7j|0 ~ Bernoulli(c • 



1 ^k^kj 



IPk- 



T.k^i'^hj/pk 
with c a hyperparameter to be elicited. 

In our approach we have made use of PLS components as summary mea- 
sures of the expression of genes belonging to known pathways and then 
applied a fully Bayesian approach for the selection of the pathways to be 
included in the model, and the genes to be included within those pathways. 
Penalized techniques, including lasso [Tibshirani (1996)], elastic net [Zou 
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and Hastie (2005)] and group lasso [Yuan and Lin (2006)] have been stud- 
ied extensively in the literature and have been successfully applied to gene 
expression data. The group lasso, in particular, defines sets of variables, 
then selects either all the variables in the group or none of them. Recently, 
a modification of the method was proposed by Friedman, Hastie and Tib- 
shirani (2010) using a more general penalty that yields sparsity at both the 
group and individual feature levels to select groups and predictors within 
each group. Our understanding of group lasso is that the method works 
best in situations where variables belonging to the same group are highly 
correlated, while covariates in different groups do not exhibit high correla- 
tion. However, genes belonging to the same pathway often do not exhibit 
high correlation in their expression levels. Also, in our case there are genes 
belonging to different pathways that have high correlation, as well as genes 
that belong to more than one pathway. Initial investigations suggest that, in 
terms of prediction MSE, Bayesian formulations of lasso methods perform 
similarly to and, in some cases, better than the frequentist lasso [see, e.g., 
Kyung et al. (2010)]. Particularly relevant to our approach is the work of 
Guan and Stephens (2011), who apply Bayesian variable selection (BVS) 
and stochastic search methods in a regression model for genome- wide data. 
In simulations they find that, in spite of the apparent computational chal- 
lenges, BVS produces better power and predictive performance compared 
with standard lasso techniques. 

SUPPLEMENTARY MATERIAL 

Supplement (DOI: 10.1214/11-AOAS463SUPP; .pdf). Description of the 
MCMC steps for (^,7) and discussion on ergodicity of the Markov chain on 
the restricted space. 
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